84 ◾ Bioinformatics
cd ref
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.
fa.gz
gunzip -d hg38.fa.gz
samtools faidx hg38.fa
bowtie2-build --thread 4 hg38.fa hg
cd ..
mkdir sam
bowtie2 -x \
ref/hg \
-1 data/SRR769545_1.fastq.gz \
-2 data/SRR769545_2.fastq.gz \
-S sam/SRR769545.sam
This time we have chosen to download the UCSC human reference genomes because the
chromosome names are given instead of accession numbers.
Once the SAM file has been created, we can convert it into a BAM file, sort it, and index
it using Samtools.
samtools view -uS -o sam/SRR769545.bam sam/SRR769545.sam
samtools sort -@ 4 -T bowtie2.tmp.sort \
-o sam/SRR769545_sorted.bam sam/SRR769545.bam
samtools index sam/SRR769545_sorted.bam
Then, we can use bcftools to detect the variants like variant substitutions and write that
information into a binary variant call format (BCF), which is the binary form of vari-
ant call format (VCF). This file includes the difference in genotype between the reference
genome and the genome of the individual sequenced. The variant discovery and VCF file
format will be discussed in detail in Chapter 4. However, they have to be used here to show
how the reference-guided genome assembly is performed. The bcftools can be installed in
Linux as follows:
sudo apt install bcftools
Now, we can use the “bcftools mpileup” command to collapse the pileup of the reads
aligned to the reference genomes in the BAM file and then to write only the variant infor-
mation into a VCF file. In the following, the variant information is written into a BCF file
and then that binary file is converted into a VCF file:
bcftools mpileup -f ref/hg38.fa \
sam/SRR769545_sorted.bam \
| bcftools call -mv -Ob \
-o sam/SRR769545.bcf
bcftools convert -O v \
-o SRR769545.vcf \
SRR769545.bcf